1 Where Do People Drink The Most Beer, Wine And Spirits?

Back in 2014, fivethiryeight.com published an article on alchohol consumption in different countries. The data drinks is available as part of the fivethirtyeight package. Make sure you have installed the fivethirtyeight package before proceeding.

library(fivethirtyeight)
data(drinks)

# or download directly
# alcohol_direct <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv")

What are the variable types? Any missing values we should worry about?

No missing values.

Column name country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
Variable type character integer integer integer double
glimpse(drinks)
## Rows: 193
## Columns: 5
## $ country                      <chr> "Afghanistan", "Albania", "Algeria", "And…
## $ beer_servings                <int> 0, 89, 25, 245, 217, 102, 193, 21, 261, 2…
## $ spirit_servings              <int> 0, 132, 0, 138, 57, 128, 25, 179, 72, 75,…
## $ wine_servings                <int> 0, 54, 14, 312, 45, 45, 221, 11, 212, 191…
## $ total_litres_of_pure_alcohol <dbl> 0.0, 4.9, 0.7, 12.4, 5.9, 4.9, 8.3, 3.8, …
skim(drinks)
Data summary
Name drinks
Number of rows 193
Number of columns 5
_______________________
Column type frequency:
character 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
country 0 1 3 28 0 193 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
beer_servings 0 1 106.16 101.14 0 20.0 76.0 188.0 376.0 ▇▃▂▂▁
spirit_servings 0 1 80.99 88.28 0 4.0 56.0 128.0 438.0 ▇▃▂▁▁
wine_servings 0 1 49.45 79.70 0 1.0 8.0 59.0 370.0 ▇▁▁▁▁
total_litres_of_pure_alcohol 0 1 4.72 3.77 0 1.3 4.2 7.2 14.4 ▇▃▅▃▁

Make a plot that shows the top 25 beer consuming countries

drinks %>% 
  slice_max ( order_by = beer_servings, n=25 ) %>% # taking top 25 countries by servings
  ggplot(aes(x = beer_servings, y = fct_reorder(country, beer_servings))) +
  geom_col(fill="orange") +
  labs(
    title = "Top 25 Beer Serving Countries in 2010",
    subtitle = "Standard Servings Per Person",
    x = "Beer Servings (in cans)",
    y = "Country"
  )

Make a plot that shows the top 25 wine consuming countries

drinks %>% 
  slice_max ( order_by = wine_servings, n=25 ) %>% # taking top 25 countries by servings
  ggplot(aes(x = wine_servings, y = fct_reorder(country, wine_servings))) +
  geom_col(fill="dark red") +
  labs(
    title = "Top 25 Wine Serving Countries in 2010",
    subtitle = "Standard Servings Per Person",
    x = "Wine Servings (in glasses)",
    y = "Country"
  )

Finally, make a plot that shows the top 25 spirit consuming countries

drinks %>% 
  slice_max ( order_by = spirit_servings, n=25 ) %>% # taking top 25 countries by servings
  ggplot(aes(x = spirit_servings, y = fct_reorder(country, spirit_servings))) +
  geom_col(fill="grey") +
  labs(
    title = "Top 25 Spirit Serving Countries in 2010",
    subtitle = "Servings (in shots) Per Person",
    x = "Spirit Servings",
    y = "Country"
  )

What can you infer from these plots? Don’t just explain what’s in the graph, but speculate or tell a short story (1-2 paragraphs max).

TYPE YOUR ANSWER AFTER (AND OUTSIDE!) THIS BLOCKQUOTE.

  1. European countries are high consumers of wine.

  2. Beer is more evenly distributed around the world in the top 25 countries, as compared to wine and spirit.

  3. European countries are higher ranked for overall consumption of beer, wine and spirit.

2 Analysis of movies- IMDB dataset

We will look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset

Besides the obvious variables of title, genre, director, year, and duration, the rest of the variables are as follows:

  • gross : The gross earnings in the US box office, not adjusted for inflation
  • budget: The movie’s budget
  • cast_facebook_likes: the number of facebook likes cast members received
  • votes: the number of people who voted for (or rated) the movie in IMDB
  • reviews: the number of reviews for that movie
  • rating: IMDB average rating

2.1 Use your data import, inspection, and cleaning skills to answer the following:

  • Are there any missing values (NAs)? Are all entries distinct or are there duplicate entries?
movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)
## Rows: 2,961
## Columns: 11
## $ title               <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre               <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director            <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year                <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration            <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross               <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget              <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes               <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews             <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating              <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …
# no missing values. There are Duplicates (2907 distinct titles in 2961 rows). 
skim(movies)
Data summary
Name movies
Number of rows 2961
Number of columns 11
_______________________
Column type frequency:
character 3
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
title 0 1 1 83 0 2907 0
genre 0 1 5 11 0 17 0
director 0 1 3 32 0 1366 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2.00e+03 9.95e+00 1920.0 2.00e+03 2.00e+03 2.01e+03 2.02e+03 ▁▁▁▂▇
duration 0 1 1.10e+02 2.22e+01 37.0 9.50e+01 1.06e+02 1.19e+02 3.30e+02 ▃▇▁▁▁
gross 0 1 5.81e+07 7.25e+07 703.0 1.23e+07 3.47e+07 7.56e+07 7.61e+08 ▇▁▁▁▁
budget 0 1 4.06e+07 4.37e+07 218.0 1.10e+07 2.60e+07 5.50e+07 3.00e+08 ▇▂▁▁▁
cast_facebook_likes 0 1 1.24e+04 2.05e+04 0.0 2.24e+03 4.60e+03 1.69e+04 6.57e+05 ▇▁▁▁▁
votes 0 1 1.09e+05 1.58e+05 5.0 1.99e+04 5.57e+04 1.33e+05 1.69e+06 ▇▁▁▁▁
reviews 0 1 5.03e+02 4.94e+02 2.0 1.99e+02 3.64e+02 6.31e+02 5.31e+03 ▇▁▁▁▁
rating 0 1 6.39e+00 1.05e+00 1.6 5.80e+00 6.50e+00 7.10e+00 9.30e+00 ▁▁▆▇▁
# show the duplicate movies
movies %>% count(title, sort=T)
## # A tibble: 2,907 × 2
##    title                           n
##    <chr>                       <int>
##  1 Home                            3
##  2 A Nightmare on Elm Street       2
##  3 Across the Universe             2
##  4 Alice in Wonderland             2
##  5 Aloha                           2
##  6 Around the World in 80 Days     2
##  7 Brothers                        2
##  8 Carrie                          2
##  9 Chasing Liberty                 2
## 10 Cinderella                      2
## # … with 2,897 more rows
# to see what happens with the duplicates
movies %>% filter(title=="Homes")
## # A tibble: 0 × 11
## # … with 11 variables: title <chr>, genre <chr>, director <chr>, year <dbl>,
## #   duration <dbl>, gross <dbl>, budget <dbl>, cast_facebook_likes <dbl>,
## #   votes <dbl>, reviews <dbl>, rating <dbl>
# `distinct` function can only keep the first entry but not latest
# movies <- distinct(movies, title, .keep_all=T)
length(unique(movies$title))
## [1] 2907
movies <- movies %>% 
  group_by(title) %>% 
  filter(votes == max(votes)) %>%
  ungroup()

# there are still duplicates
movies %>% count(title, sort=T)
## # A tibble: 2,907 × 2
##    title                          n
##    <chr>                      <int>
##  1 Chasing Liberty                2
##  2 10 Cloverfield Lane            1
##  3 10 Days in a Madhouse          1
##  4 10 Things I Hate About You     1
##  5 102 Dalmatians                 1
##  6 10th & Wolf                    1
##  7 12 Rounds                      1
##  8 12 Years a Slave               1
##  9 127 Hours                      1
## 10 13 Going on 30                 1
## # … with 2,897 more rows
# to see what happens with the duplicates
movies %>% filter(title=="Chasing Liberty")
## # A tibble: 2 × 11
##   title    genre  director   year duration   gross budget cast_facebook_l… votes
##   <chr>    <chr>  <chr>     <dbl>    <dbl>   <dbl>  <dbl>            <dbl> <dbl>
## 1 Chasing… Comedy Andy Cad…  2004      101  1.22e7  2.3e7              842 30092
## 2 Chasing… Comedy Andy Cad…  2004      101  1.22e7  2.3e7              829 30092
## # … with 2 more variables: reviews <dbl>, rating <dbl>
# do the filter only for the entries of Chasing Liberty 
movies <- movies %>%
  group_by(title) %>% 
  filter(cast_facebook_likes==max(cast_facebook_likes)) %>%
  ungroup()

skim(movies)
Data summary
Name movies
Number of rows 2907
Number of columns 11
_______________________
Column type frequency:
character 3
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
title 0 1 1 83 0 2907 0
genre 0 1 5 11 0 17 0
director 0 1 3 32 0 1366 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2.00e+03 9.92e+00 1920.0 2.00e+03 2.00e+03 2.01e+03 2.02e+03 ▁▁▁▂▇
duration 0 1 1.10e+02 2.23e+01 37.0 9.50e+01 1.05e+02 1.19e+02 3.30e+02 ▃▇▁▁▁
gross 0 1 5.76e+07 7.23e+07 703.0 1.20e+07 3.45e+07 7.51e+07 7.61e+08 ▇▁▁▁▁
budget 0 1 4.02e+07 4.32e+07 218.0 1.10e+07 2.50e+07 5.50e+07 3.00e+08 ▇▂▁▁▁
cast_facebook_likes 0 1 1.23e+04 2.05e+04 0.0 2.22e+03 4.54e+03 1.68e+04 6.57e+05 ▇▁▁▁▁
votes 0 1 1.09e+05 1.59e+05 5.0 1.95e+04 5.47e+04 1.32e+05 1.69e+06 ▇▁▁▁▁
reviews 0 1 4.98e+02 4.93e+02 2.0 1.97e+02 3.58e+02 6.24e+02 5.31e+03 ▇▁▁▁▁
rating 0 1 6.39e+00 1.06e+00 1.6 5.80e+00 6.50e+00 7.10e+00 9.30e+00 ▁▁▆▇▁
  1. There are no missing values in the dataset.
  2. The movies with duplicated entries and how we cleaned them are shown as above.
  • Produce a table with the count of movies by genre, ranked in descending order
movies %>% count(genre, sort = TRUE)
## # A tibble: 17 × 2
##    genre           n
##    <chr>       <int>
##  1 Comedy        844
##  2 Action        719
##  3 Drama         484
##  4 Adventure     281
##  5 Crime         198
##  6 Biography     135
##  7 Horror        128
##  8 Animation      35
##  9 Fantasy        26
## 10 Documentary    25
## 11 Mystery        15
## 12 Sci-Fi          7
## 13 Family          3
## 14 Musical         2
## 15 Romance         2
## 16 Western         2
## 17 Thriller        1
  • Produce a table with the average gross earning and budget (gross and budget) by genre. Calculate a variable return_on_budget which shows how many $ did a movie make at the box office for each $ of its budget. Ranked genres by this return_on_budget in descending order
movies %>% 
  mutate(movies_return = gross/budget ) %>%
  group_by(genre) %>%
  summarise(avg_gross = mean(gross),
            avg_budget = mean(budget),
            genre_return_on_budget = sum(gross)/sum(budget),
            movie_mean_return_on_budget = mean(movies_return)) %>%
  arrange(-movie_mean_return_on_budget)
## # A tibble: 17 × 5
##    genre        avg_gross avg_budget genre_return_on_budget movie_mean_return_o…
##    <chr>            <dbl>      <dbl>                  <dbl>                <dbl>
##  1 Horror       37782310.  13804379.                2.74                86.1    
##  2 Biography    45201805.  28543696.                1.58                22.3    
##  3 Musical      92084000    3189500                28.9                 18.8    
##  4 Family      149160478.  14833333.               10.1                 14.1    
##  5 Documentary  17353973.   5887852.                2.95                 8.70   
##  6 Western      20821884    3465000                 6.01                 7.06   
##  7 Fantasy      41902674.  18484615.                2.27                 6.10   
##  8 Animation    98433792.  61701429.                1.60                 5.01   
##  9 Comedy       42487808.  24458506.                1.74                 3.70   
## 10 Romance      31264848.  25107500                 1.25                 3.17   
## 11 Drama        36754959.  25832605.                1.42                 2.98   
## 12 Mystery      69117136.  41500000                 1.67                 2.90   
## 13 Adventure    94350236.  64692313.                1.46                 2.44   
## 14 Crime        37601525.  26527405.                1.42                 2.19   
## 15 Action       86270343.  70774558.                1.22                 1.93   
## 16 Sci-Fi       29788371.  27607143.                1.08                 1.58   
## 17 Thriller         2468     300000                 0.00823              0.00823
  • Produce a table that shows the top 15 directors who have created the highest gross revenue in the box office. Don’t just show the total gross amount, but also the mean, median, and standard deviation per director.
movies %>%
  group_by(director) %>%
  summarise(total_gross = sum(gross),
            mean_gross = mean(gross),
            median_gross = median(gross),
            standard_dev_gross = sd(gross)) %>%
  slice_max ( order_by = total_gross, n = 15)
## # A tibble: 15 × 5
##    director          total_gross mean_gross median_gross standard_dev_gross
##    <chr>                   <dbl>      <dbl>        <dbl>              <dbl>
##  1 Steven Spielberg   4014061704 174524422.   164435221          101421051.
##  2 Michael Bay        2195443511 182953626.   168468240.         125789167.
##  3 James Cameron      1909725910 318287652.   175562880.         309171337.
##  4 Christopher Nolan  1813227576 226653447    196667606.         187224133.
##  5 George Lucas       1741418480 348283696    380262555          146193880.
##  6 Robert Zemeckis    1619309108 124562239.   100853835           91300279.
##  7 Tim Burton         1557078534 111219895.    69791834           99304293.
##  8 Sam Raimi          1443167519 180395940.   138480208          174705230.
##  9 Clint Eastwood     1378321100  72543216.    46700000           75487408.
## 10 Francis Lawrence   1358501971 271700394.   281666058          135437020.
## 11 Ron Howard         1335988092 111332341    101587923           81933761.
## 12 Gore Verbinski     1329600995 189942999.   123207194          154473822.
## 13 Andrew Adamson     1137446920 284361730    279680930.         120895765.
## 14 Shawn Levy         1129750988 102704635.    85463309           65484773.
## 15 Ridley Scott       1128857598  80632686.    47775715           68812285.
  • Finally, ratings. Produce a table that describes how ratings are distributed by genre. We don’t want just the mean, but also, min, max, median, SD and some kind of a histogram or density graph that visually shows how ratings are distributed.
movies_rating <- movies %>%
  group_by(genre) %>%
  summarise(mean_rating = mean(rating),
            min_rating = min(rating),
             max_rating = max(rating),
             sd_rating = sd(rating)) 
movies_rating
## # A tibble: 17 × 5
##    genre       mean_rating min_rating max_rating sd_rating
##    <chr>             <dbl>      <dbl>      <dbl>     <dbl>
##  1 Action             6.23        2.1        9       1.04 
##  2 Adventure          6.51        2.3        8.6     1.11 
##  3 Animation          6.65        4.5        8       0.968
##  4 Biography          7.11        4.5        8.9     0.760
##  5 Comedy             6.11        1.9        8.8     1.02 
##  6 Crime              6.92        4.8        9.3     0.853
##  7 Documentary        6.66        1.6        8.5     1.77 
##  8 Drama              6.74        2.1        8.8     0.915
##  9 Family             6.5         5.7        7.9     1.22 
## 10 Fantasy            6.08        4.3        7.9     0.953
## 11 Horror             5.79        3.6        8.5     0.987
## 12 Musical            6.75        6.3        7.2     0.636
## 13 Mystery            6.84        4.6        8.5     0.910
## 14 Romance            6.65        6.2        7.1     0.636
## 15 Sci-Fi             6.66        5          8.2     1.09 
## 16 Thriller           4.8         4.8        4.8    NA    
## 17 Western            5.7         4.1        7.3     2.26
movies %>%
  ggplot(mapping = aes(x = rating)) + 
  geom_histogram(bins=30) +
  facet_wrap(~genre)+
  labs(title = "Distribution of ratings in each genre",
       x = "Rating (1-10)",
       y = "Num of movies")

  NULL
## NULL

2.2 Use ggplot to answer the following

  • Examine the relationship between gross and cast_facebook_likes. Produce a scatterplot and write one sentence discussing whether the number of facebook likes that the cast has received is likely to be a good predictor of how much money a movie will make at the box office. What variable are you going to map to the Y- and X- axes?
ggplot(movies, aes(x = cast_facebook_likes, y = gross)) +
  geom_point() +
  geom_smooth(method = "lm")+
  NULL

  1. Facebook likes do not seem like a good indicator of the gross as there is no direct correlation as seen from the scatter plot.
  2. We mapped gross to Y axes and number of facebook likes to X, because the gross is the final outcome of a movie, aka dependent variable.
  • Examine the relationship between gross and budget. Produce a scatterplot and write one sentence discussing whether budget is likely to be a good predictor of how much money a movie will make at the box office.
ggplot(movies, aes(x = budget , y = gross)) +
  geom_point() +
  geom_smooth(method = "lm")

  • The budget and gross do seem correlated. The higher the budget, it is more likely that the gross may be higher.

  • Examine the relationship between gross and rating. Produce a scatterplot, faceted by genre and discuss whether IMDB ratings are likely to be a good predictor of how much money a movie will make at the box office. Is there anything strange in this dataset?

ggplot(movies, aes(x = rating , y = gross)) +
  geom_point() +
  geom_smooth(method = "lm") +
  facet_wrap(~genre) +
  labs(title = "Gross vs Rating of Movies For Each Genre ",
       x = "Rating",
       y = "Gross") +
  NULL

We can see that:

  • The higher the rating the more will be the gross for the most genres of movies.

  • For movies of some genres like ‘Documentary’, ‘Mystery’, ‘Horror’ and ‘Sci-Fi’, the gross has a very less change with respect to rating. Documentaries certainly have a different business model.

  • Negative correlation even appears.

3 Returns of financial stocks

You may find useful the material on finance data sources.

We will use the tidyquant package to download historical data of stock prices, calculate returns, and examine the distribution of returns.

We must first identify which stocks we want to download data for, and for this we must know their ticker symbol; Apple is known as AAPL, Microsoft as MSFT, McDonald’s as MCD, etc. The file nyse.csv contains 508 stocks listed on the NYSE, their ticker symbol, name, the IPO (Initial Public Offering) year, and the sector and industry the company is in.

nyse <- read_csv(here::here("data","nyse.csv"))

glimpse(nyse)
## Rows: 508
## Columns: 6
## $ symbol        <chr> "MMM", "ABB", "ABT", "ABBV", "ACN", "AAP", "AFL", "A", "…
## $ name          <chr> "3M Company", "ABB Ltd", "Abbott Laboratories", "AbbVie …
## $ ipo_year      <chr> "n/a", "n/a", "n/a", "2012", "2001", "n/a", "n/a", "1999…
## $ sector        <chr> "Health Care", "Consumer Durables", "Health Care", "Heal…
## $ industry      <chr> "Medical/Dental Instruments", "Electrical Products", "Ma…
## $ summary_quote <chr> "https://www.nasdaq.com/symbol/mmm", "https://www.nasdaq…
nyse$sector
##   [1] "Health Care"           "Consumer Durables"     "Health Care"          
##   [4] "Health Care"           "Miscellaneous"         "Consumer Services"    
##   [7] "Finance"               "Capital Goods"         "Basic Industries"     
##  [10] "Basic Industries"      "Health Care"           "Consumer Services"    
##  [13] "Miscellaneous"         "Finance"               "Health Care"          
##  [16] "Finance"               "Finance"               "Consumer Services"    
##  [19] "Consumer Non-Durables" "Consumer Non-Durables" "Consumer Durables"    
##  [22] "Public Utilities"      "Public Utilities"      "Public Utilities"     
##  [25] "Public Utilities"      "Finance"               "Finance"              
##  [28] "Consumer Services"     "Public Utilities"      "Finance"              
##  [31] "Health Care"           "Capital Goods"         "Consumer Durables"    
##  [34] "Consumer Non-Durables" "Consumer Services"     "Health Care"          
##  [37] "Finance"               "Finance"               "Capital Goods"        
##  [40] "Consumer Services"     "Basic Industries"      "Consumer Non-Durables"
##  [43] "Capital Goods"         "Technology"            "Finance"              
##  [46] "Health Care"           "Public Utilities"      "Public Utilities"     
##  [49] "Technology"            "Consumer Services"     "Consumer Services"    
##  [52] "Public Utilities"      "Finance"               "Energy"               
##  [55] "Consumer Durables"     "Finance"               "Finance"              
##  [58] "Finance"               "Finance"               "Finance"              
##  [61] "Finance"               "Finance"               "Finance"              
##  [64] "Finance"               "Finance"               "Finance"              
##  [67] "Finance"               "Finance"               "Basic Industries"     
##  [70] "Health Care"           "Finance"               "Public Utilities"     
##  [73] "Health Care"           "Consumer Services"     "Basic Industries"     
##  [76] "Energy"                "Capital Goods"         "Finance"              
##  [79] "Capital Goods"         "Consumer Services"     "Consumer Services"    
##  [82] "Health Care"           "Energy"                "Health Care"          
##  [85] "Consumer Non-Durables" "Miscellaneous"         "Consumer Services"    
##  [88] "Consumer Services"     "Finance"               "Public Utilities"     
##  [91] "Consumer Services"     "Energy"                "Consumer Services"    
##  [94] "Consumer Non-Durables" "Finance"               "Transportation"       
##  [97] "Energy"                "Transportation"        "Miscellaneous"        
## [100] "Finance"               "Health Care"           "Consumer Durables"    
## [103] "Consumer Services"     "Consumer Services"     "Consumer Durables"    
## [106] "Capital Goods"         "Finance"               "Consumer Services"    
## [109] "Basic Industries"      "Energy"                "Health Care"          
## [112] "Public Utilities"      "Public Utilities"      "Public Utilities"     
## [115] "Basic Industries"      "Consumer Services"     "Energy"               
## [118] "Consumer Services"     "Finance"               "Public Utilities"     
## [121] "Energy"                "Public Utilities"      "Public Utilities"     
## [124] "Consumer Services"     "Finance"               "Public Utilities"     
## [127] "Basic Industries"      "Health Care"           "Finance"              
## [130] "Finance"               "Consumer Durables"     "Public Utilities"     
## [133] "Finance"               "Capital Goods"         "Consumer Non-Durables"
## [136] "Consumer Non-Durables" "Consumer Non-Durables" "Consumer Non-Durables"
## [139] "Consumer Non-Durables" "Energy"                "Energy"               
## [142] "Public Utilities"      "Consumer Non-Durables" "Energy"               
## [145] "Basic Industries"      "Consumer Non-Durables" "Finance"              
## [148] "Finance"               "Capital Goods"         "Consumer Services"    
## [151] "Energy"                "Health Care"           "Capital Goods"        
## [154] "Capital Goods"         "Consumer Services"     "Capital Goods"        
## [157] "Technology"            "Transportation"        "Finance"              
## [160] "Consumer Non-Durables" "Consumer Services"     "Finance"              
## [163] "Consumer Services"     "Public Utilities"      "Technology"           
## [166] "Basic Industries"      "Public Utilities"      "Public Utilities"     
## [169] "Consumer Services"     "Basic Industries"      "Technology"           
## [172] "Basic Industries"      "Energy"                "Public Utilities"     
## [175] "Health Care"           "Health Care"           "Health Care"          
## [178] "Energy"                "Energy"                "Public Utilities"     
## [181] "Public Utilities"      "Energy"                "Public Utilities"     
## [184] "Public Utilities"      "Energy"                "Technology"           
## [187] "Finance"               "Energy"                "Consumer Services"    
## [190] "Consumer Services"     "Consumer Services"     "Consumer Non-Durables"
## [193] "Finance"               "Public Utilities"      "Public Utilities"     
## [196] "Public Utilities"      "Consumer Services"     "Energy"               
## [199] "Technology"            "Miscellaneous"         "Consumer Services"    
## [202] "Transportation"        "Capital Goods"         "Capital Goods"        
## [205] "Finance"               "Miscellaneous"         "Public Utilities"     
## [208] "Miscellaneous"         "Basic Industries"      "Consumer Non-Durables"
## [211] "Capital Goods"         "Public Utilities"      "Capital Goods"        
## [214] "Basic Industries"      "Finance"               "Basic Industries"     
## [217] "Health Care"           "Consumer Services"     "Capital Goods"        
## [220] "Energy"                "Consumer Non-Durables" "Capital Goods"        
## [223] "Capital Goods"         "Health Care"           "Miscellaneous"        
## [226] "Finance"               "Technology"            "Finance"              
## [229] "Energy"                "Finance"               "Health Care"          
## [232] "Consumer Services"     "Finance"               "Capital Goods"        
## [235] "Consumer Non-Durables" "Energy"                "Technology"           
## [238] "Consumer Services"     "Consumer Services"     "Capital Goods"        
## [241] "Capital Goods"         "Consumer Non-Durables" "Consumer Services"    
## [244] "Technology"            "Finance"               "Health Care"          
## [247] "Finance"               "Capital Goods"         "Technology"           
## [250] "Technology"            "Technology"            "Finance"              
## [253] "Capital Goods"         "Finance"               "Consumer Services"    
## [256] "Technology"            "Basic Industries"      "Basic Industries"     
## [259] "Finance"               "Health Care"           "Finance"              
## [262] "Finance"               "Consumer Non-Durables" "Basic Industries"     
## [265] "Health Care"           "Consumer Services"     "Transportation"       
## [268] "Finance"               "Consumer Non-Durables" "Consumer Non-Durables"
## [271] "Finance"               "Capital Goods"         "Consumer Durables"    
## [274] "Public Utilities"      "Finance"               "Energy"               
## [277] "Public Utilities"      "Consumer Services"     "Capital Goods"        
## [280] "Health Care"           "Consumer Non-Durables" "Consumer Services"    
## [283] "Technology"            "Basic Industries"      "Finance"              
## [286] "Basic Industries"      "Consumer Services"     "Finance"              
## [289] "Capital Goods"         "Finance"               "Consumer Services"    
## [292] "Basic Industries"      "Finance"               "Energy"               
## [295] "Capital Goods"         "Finance"               "Energy"               
## [298] "Energy"                "Finance"               "Finance"              
## [301] "Basic Industries"      "Basic Industries"      "Miscellaneous"        
## [304] "Consumer Non-Durables" "Consumer Services"     "Health Care"          
## [307] "Health Care"           "Health Care"           "Finance"              
## [310] "Capital Goods"         "Consumer Services"     "Consumer Services"    
## [313] "Finance"               "Finance"               "Consumer Non-Durables"
## [316] "Finance"               "Finance"               "Technology"           
## [319] "Energy"                "Miscellaneous"         "Public Utilities"     
## [322] "Consumer Services"     "Basic Industries"      "Public Utilities"     
## [325] "Consumer Non-Durables" "Public Utilities"      "Energy"               
## [328] "Technology"            "Finance"               "Transportation"       
## [331] "Capital Goods"         "Consumer Services"     "Health Care"          
## [334] "Health Care"           "Basic Industries"      "Basic Industries"     
## [337] "Capital Goods"         "Energy"                "Technology"           
## [340] "Public Utilities"      "Technology"            "Public Utilities"     
## [343] "Finance"               "Technology"            "Technology"           
## [346] "Capital Goods"         "Technology"            "Energy"               
## [349] "Energy"                "Energy"                "Health Care"          
## [352] "Consumer Non-Durables" "Energy"                "Energy"               
## [355] "Public Utilities"      "Technology"            "Energy"               
## [358] "Energy"                "Finance"               "Basic Industries"     
## [361] "Basic Industries"      "Public Utilities"      "Basic Industries"     
## [364] "Finance"               "Consumer Services"     "Finance"              
## [367] "Finance"               "Public Utilities"      "Public Utilities"     
## [370] "Consumer Services"     "Health Care"           "Finance"              
## [373] "Capital Goods"         "Consumer Services"     "Finance"              
## [376] "Consumer Services"     "Public Utilities"      "Health Care"          
## [379] "Consumer Services"     "Technology"            "Basic Industries"     
## [382] "Capital Goods"         "Consumer Services"     "Finance"              
## [385] "Capital Goods"         "Finance"               "Finance"              
## [388] "Consumer Services"     "Finance"               "Technology"           
## [391] "Technology"            "Energy"                "Energy"               
## [394] "Consumer Services"     "Public Utilities"      "Technology"           
## [397] "Consumer Services"     "Finance"               "Technology"           
## [400] "Consumer Services"     "Public Utilities"      "Consumer Services"    
## [403] "Health Care"           "Technology"            "Consumer Non-Durables"
## [406] "Public Utilities"      "Basic Industries"      "Transportation"       
## [409] "Consumer Services"     "Public Utilities"      "Technology"           
## [412] "Capital Goods"         "Finance"               "Health Care"          
## [415] "Technology"            "Health Care"           "Finance"              
## [418] "Consumer Services"     "Finance"               "Energy"               
## [421] "Finance"               "Finance"               "Consumer Non-Durables"
## [424] "Technology"            "Health Care"           "Consumer Services"    
## [427] "Consumer Services"     "Public Utilities"      "Technology"           
## [430] "Energy"                "Capital Goods"         "Health Care"          
## [433] "Public Utilities"      "Public Utilities"      "Public Utilities"     
## [436] "Basic Industries"      "Consumer Services"     "Capital Goods"        
## [439] "Basic Industries"      "Finance"               "Finance"              
## [442] "Health Care"           "Finance"               "Capital Goods"        
## [445] "Consumer Services"     "Consumer Services"     "Consumer Services"    
## [448] "Finance"               "Energy"                "Miscellaneous"        
## [451] "Capital Goods"         "Capital Goods"         "Finance"              
## [454] "Technology"            "Technology"            "Consumer Non-Durables"
## [457] "Finance"               "Technology"            "Finance"              
## [460] "Consumer Services"     "Public Utilities"      "Basic Industries"     
## [463] "Basic Industries"      "Transportation"        "Transportation"       
## [466] "Capital Goods"         "Health Care"           "Health Care"          
## [469] "Consumer Non-Durables" "Basic Industries"      "Energy"               
## [472] "Technology"            "Consumer Services"     "Public Utilities"     
## [475] "Consumer Services"     "Miscellaneous"         "Public Utilities"     
## [478] "Technology"            "Consumer Services"     "Basic Industries"     
## [481] "Consumer Services"     "Finance"               "Consumer Services"    
## [484] "Consumer Services"     "Consumer Services"     "Public Utilities"     
## [487] "Public Utilities"      "Capital Goods"         "Consumer Services"    
## [490] "Public Utilities"      "Health Care"           "Finance"              
## [493] "Consumer Services"     "Basic Industries"      "Public Utilities"     
## [496] "Capital Goods"         "Finance"               "Consumer Services"    
## [499] "Basic Industries"      "Public Utilities"      "Technology"           
## [502] "Technology"            "Capital Goods"         "Consumer Services"    
## [505] "Consumer Services"     "Health Care"           "Health Care"          
## [508] "Transportation"

Based on this dataset, create a table and a bar plot that shows the number of companies per sector, in descending order

# a easier way
# nyse %>% 
#   select(sector) %>% 
#   table() %>%
#   sort(decreasing = T) %>%
#   barplot()

nyse %>%
  group_by(sector) %>% 
  mutate(company_num = count(sector)) %>%
  ggplot(aes(x=company_num, y=fct_reorder(sector, company_num))) +
  geom_bar(stat="identity") +
  labs(title = "Number of Companies in each Sector",
       x = "",
       y = "Sector")

Next, let’s choose some stocks and their ticker symbols and download some data. You MUST choose 6 different stocks from the ones listed below; You should, however, add SPY which is the SP500 ETF (Exchange Traded Fund).

# Notice the cache=TRUE argument inthe chunk options. Because getting data is time consuming, 
# cache=TRUE means that once it downloads data, the chunk will not run again next time you knit your Rmd

myStocks <- c("EBR","BBL","AEE","BCE","BRO","CAT","BUD","SPY" ) %>%
  tq_get(get  = "stock.prices",
         from = "2011-01-01",
         to   = "2021-08-31") %>%
  group_by(symbol) 

glimpse(myStocks) # examine the structure of the resulting data frame
## Rows: 21,464
## Columns: 8
## Groups: symbol [8]
## $ symbol   <chr> "EBR", "EBR", "EBR", "EBR", "EBR", "EBR", "EBR", "EBR", "EBR"…
## $ date     <date> 2011-01-03, 2011-01-04, 2011-01-05, 2011-01-06, 2011-01-07, …
## $ open     <dbl> 13.9, 14.1, 14.3, 14.3, 14.0, 13.8, 13.8, 13.8, 13.9, 14.0, 1…
## $ high     <dbl> 14.0, 14.3, 14.5, 14.3, 14.0, 13.9, 13.8, 13.9, 14.1, 14.3, 1…
## $ low      <dbl> 13.9, 14.0, 14.3, 14.0, 13.8, 13.7, 13.6, 13.7, 13.8, 13.9, 1…
## $ close    <dbl> 14.0, 14.2, 14.4, 14.1, 14.0, 13.7, 13.7, 13.8, 14.0, 14.2, 1…
## $ volume   <dbl> 878300, 802900, 950300, 855800, 586000, 1020300, 1545800, 112…
## $ adjusted <dbl> 9.85, 10.00, 10.13, 9.92, 9.82, 9.63, 9.65, 9.71, 9.84, 10.03…

We can see that the dataset is 8x21,464 tibble with each row being the ohlc (open,high,low,close) and volume for a stock on a given date.

Financial performance analysis depend on returns; If I buy a stock today for 100 and I sell it tomorrow for 101.75, my one-day return, assuming no transaction costs, is 1.75%. So given the adjusted closing prices, our first step is to calculate daily and monthly returns.

#calculate daily returns
myStocks_returns_daily <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "daily", 
               type       = "log",
               col_rename = "daily_returns",
               cols = c(nested.col))  

#calculate monthly  returns
myStocks_returns_monthly <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "monthly", 
               type       = "arithmetic",
               col_rename = "monthly_returns",
               cols = c(nested.col)) 

#calculate yearly returns
myStocks_returns_annual <- myStocks %>%
  group_by(symbol) %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "yearly", 
               type       = "arithmetic",
               col_rename = "yearly_returns",
               cols = c(nested.col))

Create a table where you summarise monthly returns for each of the stocks and SPY; min, max, median, mean, SD.

glimpse(myStocks_returns_monthly)
## Rows: 1,024
## Columns: 3
## Groups: symbol [8]
## $ symbol          <chr> "EBR", "EBR", "EBR", "EBR", "EBR", "EBR", "EBR", "EBR"…
## $ date            <date> 2011-01-31, 2011-02-28, 2011-03-31, 2011-04-29, 2011-…
## $ monthly_returns <dbl> -0.0264, 0.0487, 0.0861, -0.0445, -0.0263, -0.0264, -0…
monthlystocks_summarised <- myStocks_returns_monthly %>% 
  group_by(symbol) %>%
  summarise(min_return = min(monthly_returns),
            max_return = max(monthly_returns),
            median_return = median(monthly_returns),
            mean_return = mean(monthly_returns),
            sd_return = sd(monthly_returns))

Plot a density plot, using geom_density(), for each of the stocks

ggplot(myStocks_returns_monthly, aes(x = monthly_returns)) +
  geom_density() + 
  facet_wrap(~symbol) +
  labs(
    title = "Density Plots for Monthly Returns of Stocks",
    x = "Monthly Return",
    y = "Density"
  ) +
  NULL

What can you infer from this plot? Which stock is the riskiest? The least risky?

TYPE YOUR ANSWER AFTER (AND OUTSIDE!) THIS BLOCKQUOTE.

The monthly returns for the flatter density plots are more dispersed whereas those with tall peaks are more concentrated. The riskiest stock is EBR and the least risky is SPY (as an ETF) due to the shape of their peaks.

Finally, make a plot that shows the expected monthly return (mean) of a stock on the Y axis and the risk (standard deviation) in the X-axis. Please use ggrepel::geom_text_repel() to label each stock

monthlystocks_summarised%>%
  ggplot(aes(y = mean_return, x=sd_return)) +
  geom_point() + 
  ggrepel::geom_text_repel(aes(label = symbol)) +
  NULL

What can you infer from this plot? Are there any stocks which, while being riskier, do not have a higher expected return?

TYPE YOUR ANSWER AFTER (AND OUTSIDE!) THIS BLOCKQUOTE.

EBR is the most risky as it has the highest standard deviation of returns. EBR has the highest sd but also a high expected return. AEE, CAT and SPY also have high expected returns with less standard deviation. BBL and BUD have lower expected returns and higher standard deviations meaning they are riskier and do not have high expected returns. SPY produces a good return with lower risk.

4 On your own: IBM HR Analytics

For this task, you will analyse a data set on Human Resource Analytics. The IBM HR Analytics Employee Attrition & Performance data set is a fictional data set created by IBM data scientists. Among other things, the data set includes employees’ income, their distance from work, their position in the company, their level of education, etc. A full description can be found on the website.

First let us load the data

hr_dataset <- read_csv(here::here("data", "datasets_1067_1925_WA_Fn-UseC_-HR-Employee-Attrition.csv"))
glimpse(hr_dataset)
## Rows: 1,470
## Columns: 35
## $ Age                      <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
## $ Attrition                <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "…
## $ BusinessTravel           <chr> "Travel_Rarely", "Travel_Frequently", "Travel…
## $ DailyRate                <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,…
## $ Department               <chr> "Sales", "Research & Development", "Research …
## $ DistanceFromHome         <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
## $ Education                <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
## $ EducationField           <chr> "Life Sciences", "Life Sciences", "Other", "L…
## $ EmployeeCount            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ EmployeeNumber           <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,…
## $ EnvironmentSatisfaction  <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, …
## $ Gender                   <chr> "Female", "Male", "Male", "Female", "Male", "…
## $ HourlyRate               <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4…
## $ JobInvolvement           <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, …
## $ JobLevel                 <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, …
## $ JobRole                  <chr> "Sales Executive", "Research Scientist", "Lab…
## $ JobSatisfaction          <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, …
## $ MaritalStatus            <chr> "Single", "Married", "Single", "Married", "Ma…
## $ MonthlyIncome            <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269…
## $ MonthlyRate              <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 9964…
## $ NumCompaniesWorked       <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, …
## $ Over18                   <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", …
## $ OverTime                 <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes",…
## $ PercentSalaryHike        <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1…
## $ PerformanceRating        <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, …
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, …
## $ StandardHours            <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8…
## $ StockOptionLevel         <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, …
## $ TotalWorkingYears        <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3…
## $ TrainingTimesLastYear    <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, …
## $ WorkLifeBalance          <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, …
## $ YearsAtCompany           <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,…
## $ YearsInCurrentRole       <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, …
## $ YearsSinceLastPromotion  <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, …
## $ YearsWithCurrManager     <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, …

I am going to clean the data set, as variable names are in capital letters, some variables are not really necessary, and some variables, e.g., education are given as a number rather than a more useful description

hr_cleaned <- hr_dataset %>% 
  clean_names() %>% 
  mutate(
    education = case_when(
      education == 1 ~ "Below College",
      education == 2 ~ "College",
      education == 3 ~ "Bachelor",
      education == 4 ~ "Master",
      education == 5 ~ "Doctor"
    ),
    environment_satisfaction = case_when(
      environment_satisfaction == 1 ~ "Low",
      environment_satisfaction == 2 ~ "Medium",
      environment_satisfaction == 3 ~ "High",
      environment_satisfaction == 4 ~ "Very High"
    ),
    job_satisfaction = case_when(
      job_satisfaction == 1 ~ "Low",
      job_satisfaction == 2 ~ "Medium",
      job_satisfaction == 3 ~ "High",
      job_satisfaction == 4 ~ "Very High"
    ),
    performance_rating = case_when(
      performance_rating == 1 ~ "Low",
      performance_rating == 2 ~ "Good",
      performance_rating == 3 ~ "Excellent",
      performance_rating == 4 ~ "Outstanding"
    ),
    work_life_balance = case_when(
      work_life_balance == 1 ~ "Bad",
      work_life_balance == 2 ~ "Good",
      work_life_balance == 3 ~ "Better",
      work_life_balance == 4 ~ "Best"
    )
  ) %>% 
  select(age, attrition, daily_rate, department,
         distance_from_home, education,
         gender, job_role,environment_satisfaction,
         job_satisfaction, marital_status,
         monthly_income, num_companies_worked, percent_salary_hike,
         performance_rating, total_working_years,
         work_life_balance, years_at_company,
         years_since_last_promotion)
glimpse(hr_cleaned)
## Rows: 1,470
## Columns: 19
## $ age                        <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35,…
## $ attrition                  <chr> "Yes", "No", "Yes", "No", "No", "No", "No",…
## $ daily_rate                 <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 135…
## $ department                 <chr> "Sales", "Research & Development", "Researc…
## $ distance_from_home         <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26…
## $ education                  <chr> "College", "Below College", "College", "Mas…
## $ gender                     <chr> "Female", "Male", "Male", "Female", "Male",…
## $ job_role                   <chr> "Sales Executive", "Research Scientist", "L…
## $ environment_satisfaction   <chr> "Medium", "High", "Very High", "Very High",…
## $ job_satisfaction           <chr> "Very High", "Medium", "High", "High", "Med…
## $ marital_status             <chr> "Single", "Married", "Single", "Married", "…
## $ monthly_income             <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 2…
## $ num_companies_worked       <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5…
## $ percent_salary_hike        <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13,…
## $ performance_rating         <chr> "Excellent", "Outstanding", "Excellent", "E…
## $ total_working_years        <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5,…
## $ work_life_balance          <chr> "Bad", "Better", "Better", "Better", "Bette…
## $ years_at_company           <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, …
## $ years_since_last_promotion <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0…

Produce a one-page summary describing this dataset. Here is a non-exhaustive list of questions:

  1. How often do people leave the company (attrition)
# 1233 employees stay while 237 left (19.2%).
hr_cleaned %>% 
  group_by(attrition)%>% 
  summarise(count = count(attrition))
## # A tibble: 2 × 2
##   attrition count
##   <chr>     <int>
## 1 No         1233
## 2 Yes         237
# to see the how attrition rate changes with years at company
prop.table(table(hr_cleaned[,c("years_at_company","attrition")]),1)[,2]%>%
  barplot(main="Attrition Rate vs Years At Company")
abline(h=0.192,col="red") # avergae attrition rate

  • As we can see from the above table only around 20% of the employees in the dataset left the company during their working years. This shows that employees do not leave that often.
  1. How are age, years_at_company, monthly_income and years_since_last_promotion distributed? can you roughly guess which of these variables is closer to Normal just by looking at summary statistics?
  • As we can see from the histograms from the summary statistics, the only variable that looks normally distributed is age.
skim(hr_cleaned)
Data summary
Name hr_cleaned
Number of rows 1470
Number of columns 19
_______________________
Column type frequency:
character 10
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
attrition 0 1 2 3 0 2 0
department 0 1 5 22 0 3 0
education 0 1 6 13 0 5 0
gender 0 1 4 6 0 2 0
job_role 0 1 7 25 0 9 0
environment_satisfaction 0 1 3 9 0 4 0
job_satisfaction 0 1 3 9 0 4 0
marital_status 0 1 6 8 0 3 0
performance_rating 0 1 9 11 0 2 0
work_life_balance 0 1 3 6 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 36.92 9.14 18 30 36 43 60 ▂▇▇▃▂
daily_rate 0 1 802.49 403.51 102 465 802 1157 1499 ▇▇▇▇▇
distance_from_home 0 1 9.19 8.11 1 2 7 14 29 ▇▅▂▂▂
monthly_income 0 1 6502.93 4707.96 1009 2911 4919 8379 19999 ▇▅▂▁▂
num_companies_worked 0 1 2.69 2.50 0 1 2 4 9 ▇▃▂▂▁
percent_salary_hike 0 1 15.21 3.66 11 12 14 18 25 ▇▅▃▂▁
total_working_years 0 1 11.28 7.78 0 6 10 15 40 ▇▇▂▁▁
years_at_company 0 1 7.01 6.13 0 3 5 9 40 ▇▂▁▁▁
years_since_last_promotion 0 1 2.19 3.22 0 0 1 3 15 ▇▁▁▁▁
  1. How are job_satisfaction and work_life_balance distributed? Don’t just report counts, but express categories as % of total?
hr_cleaned %>%  
  group_by(job_satisfaction)%>% 
  summarise(countjs = n(),
            percentagejs = countjs/nrow(hr_cleaned)*100)
## # A tibble: 4 × 3
##   job_satisfaction countjs percentagejs
##   <chr>              <int>        <dbl>
## 1 High                 442         30.1
## 2 Low                  289         19.7
## 3 Medium               280         19.0
## 4 Very High            459         31.2
hr_cleaned %>%  
  group_by(work_life_balance)%>% 
  summarise(countwlb= n(),
            percentagewlb = round(countwlb/nrow(hr_cleaned)*100,2))
## # A tibble: 4 × 3
##   work_life_balance countwlb percentagewlb
##   <chr>                <int>         <dbl>
## 1 Bad                     80          5.44
## 2 Best                   153         10.4 
## 3 Better                 893         60.8 
## 4 Good                   344         23.4
  • Job satisfaction is distributed quite evenly however the categories of High and Very High are more common with around 30% each. In terms of work life balance very few people have Bad or the Best work life balance. The majority of people have better work life balance.
  1. Is there any relationship between monthly income and education? Monthly income and gender?
ggplot(hr_cleaned, aes(x = fct_relevel(education, 
            "Doctor", "Master", "Bachelor", 
            "College", "Below College"), y = monthly_income)) +
  geom_boxplot()+
  labs(title = "Boxplot of Monthly Income against Education",
       x = "Education",
       y = "Monthly Income")+
  NULL

ggplot(hr_cleaned, aes(x = monthly_income, y = gender)) +
  geom_boxplot()+
  labs(title = "Boxplot of Monthly Income against Gender",
       x = "Monthly Income",
       y = "Gender")+
  NULL

  • As we can see from the boxplot of males and females and their monthly income, females have a higher median monthly income which could mean that the females in the dataset could be more educated. The doctors have the highest monthly median income and those with below college education have the least. Doctors also have the highest variability in the IQR. College has the most outliers due to its low standard deviation.
  1. Plot a boxplot of income vs job role. Make sure the highest-paid job roles appear first.
ggplot(hr_cleaned, aes(x=fct_reorder(job_role,-monthly_income), y=monthly_income)) +
  geom_boxplot() +
  labs(title = "Boxplot of Monthly Income against Job Role",
       x = "Job Role",
       y = "Monthly Income")+
  NULL

  1. Calculate and plot a bar chart of the mean (or median?) income by education level.
hr_cleaned %>% 
  group_by(education) %>%
  summarise(medianinc = median(monthly_income),
            meaninc = mean(monthly_income)) %>% 
  ggplot(aes(x = fct_relevel(education, 
            "Doctor", "Master", "Bachelor", 
            "College", "Below College"),
            y=meaninc)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Monthly Income of Each Education Level",
       x = "Education",
       y = "Average Monthly Income")+
  NULL

  • As we can see the medians and means differ greatly, this is due to the outliers in the dataset with individuals with abnormally high monthly income levels.
  1. Plot the distribution of income by education level. Use a facet_wrap and a theme from ggthemes
hr_cleaned %>% 
  ggplot(aes(x=monthly_income)) +
  geom_histogram(bins=30)+
  facet_wrap(~fct_relevel(education, 
            "Doctor", "Master", "Bachelor", 
            "College", "Below College")) +
  theme_wsj() +
  NULL

  1. Plot income vs age, faceted by job_role
hr_cleaned %>% 
  ggplot(aes(y=monthly_income, x=age)) +
  geom_point() +
  geom_smooth(method="lm")+
  facet_wrap(~job_role) +
  theme_wsj() +
  NULL

5 Challenge 1: Replicating a chart

The purpose of this exercise is to reproduce a plot using your dplyr and ggplot2 skills. Read the article The Racial Factor: There’s 77 Counties Which Are Deep Blue But Also Low-Vaxx. Guess What They Have In Common? and have a look at the attached figure.

You dont have to worry about the blue-red backgound and don’t worry about replicating it exactly, try and see how far you can get. You’re encouraged to work together if you want to and exchange tips/tricks you figured out– and even though the figure in the original article is from early July 2021, you can use the most recent data.

Some hints to get you started:

  1. To get vaccination by county, we will use data from the CDC
  2. You need to get County Presidential Election Returns 2000-2020
  3. Finally, you also need an estimate of the population of each county
# 3154 unique fips code in election2020_results
length(unique(election2020_results$fips))
## [1] 3154
# check unique values of candidate names
unique(election2020_results$candidate)
## [1] "JOSEPH R BIDEN JR" "OTHER"             "DONALD J TRUMP"   
## [4] "JO JORGENSEN"
data <- election2020_results %>% 
  mutate(votes_percentage = candidatevotes/totalvotes) %>% # calculate percentage
  filter(candidate=="DONALD J TRUMP", mode=="TOTAL") %>% # we only need Trump votes 
  inner_join(vaccinations,by = "fips") %>% # inner join with vaccinations data
  inner_join(population,by = "fips") # inner join with population data


# generate graph below
# install.packages("ggpubr") # to show equation easily
library(ggpubr)
# calculate actual vaccination percentage
actual = sum(data$series_complete_yes)/sum(data$pop_estimate_2019)
data %>% 
  # filter out 0% vaccinated points
  filter(series_complete_pop_pct>0) %>%
  ggplot() +
  geom_point(aes(x=votes_percentage,y=series_complete_pop_pct/100,size=pop_estimate_2019),shape=21,fill="light blue")+
  
  # scale circle size
  scale_size(range = c(0, 20)) + 
  
  # add points in the center of circles
  geom_point(aes(x=votes_percentage,y=series_complete_pop_pct/100),size=0.5)+ 
  
  # add regression line
  geom_smooth(aes(x=votes_percentage,y=series_complete_pop_pct/100),
              method="lm",linetype="dotted",colour="blue",se=FALSE)+ 
  
  # add the equation of regression line
  stat_regline_equation(aes(x=votes_percentage,y=series_complete_pop_pct/100),
                        label.y = 0.1,colour="red",fontface = "bold.italic") + 
  
  # add r square
  stat_cor(aes(x=votes_percentage,y=series_complete_pop_pct/100,label = paste(..rr.label..)),
           label.y = 0.05,colour="red")+ 
  
  # add date
  geom_text(aes(x=0.3, y=0.07,label = "09/03/2021", 
                fontface = "bold.italic"),colour="red")+ 
  
  # add horizonal lines below
  geom_hline(aes(yintercept=0.85), linetype=2) + # herd immunity line
  geom_text(aes(x=0, y=0.85, label = "Herd Immunity threshold (?)", 
                vjust=-1, hjust=0, fontface = "bold.italic"),colour="blue") +
  geom_hline(aes(yintercept=0.539), linetype=2) + # Target line
  geom_text(aes(x=0, y=0.539,label = "TARGET: 53.9%", 
                vjust=-1,hjust=0, fontface = "bold.italic"),colour="blue") +
  geom_hline(aes(yintercept=actual), linetype=2) + # actual line
  geom_text(aes(x=0, y=actual, label = paste("ACTUAL: ", round(actual*100,1),"%"),
                vjust=-1,hjust=0,fontface = "bold.italic"),
            colour="blue") +
  
  # adjuct grid lines
  scale_x_continuous(label=scales::percent_format(accuracy = 1),
                     breaks=seq(0,1,0.05),limits = c(0,1))+
  scale_y_continuous(label=scales::percent_format(accuracy = 1),
                     breaks=seq(0,1,0.05),limits = c(0,1))+
  geom_text(aes(x=0.5,y=1, label = "EACH U.S. COUNTY",vjust=0.5,hjust=0.5,
                family="mono",fontface = "bold"),
            size=5)+
  labs(title = "COVID-19 VACCINATION LEVELS OUT OF TOTAL POPULATION BY COUNTY",
       x = "2020 Trump Vote %",
       y = "% of Total Population Vaccinated")+
  # remove the legend
  theme(legend.position = "none")+
  NULL

6 Challenge 2: Opinion polls for the 2021 German elections

The Guardian newspaper has an election poll tracker for the upcoming German election. The list of the opinion polls since Jan 2021 can be found at Wikipedia and your task is to reproduce the graph similar to the one produced by the Guardian.

The following code will scrape the wikipedia page and import the table in a dataframe.

url <- "https://en.wikipedia.org/wiki/Opinion_polling_for_the_2021_German_federal_election"
# https://www.economist.com/graphic-detail/who-will-succeed-angela-merkel
# https://www.theguardian.com/world/2021/jun/21/german-election-poll-tracker-who-will-be-the-next-chancellor


# get tables that exist on wikipedia page 
tables <- url %>% 
  read_html() %>% 
  html_nodes(css="table")


# parse HTML tables into a dataframe called polls 
# Use purr::map() to create a list of all tables in URL
polls <- map(tables, . %>% 
             html_table(fill=TRUE)%>% 
             janitor::clean_names())


# list of opinion polls
german_election_polls <- polls[[1]] %>% # the first table on the page contains the list of all opinions polls
  slice(2:(n()-1)) %>%  # drop the first row, as it contains again the variable names and last row that contains 2017 results
  mutate(
         # polls are shown to run from-to, e.g. 9-13 Aug 2021. We keep the last date, 13 Aug here, as the poll date
         # and we extract it by picking the last 11 characters from that field
         end_date = str_sub(fieldwork_date, -11),
         
         # end_date is still a string, so we convert it into a date object using lubridate::dmy()
         end_date = dmy(end_date),
         
         # we also get the month and week number from the date, if we want to do analysis by month- week, etc.
         month = month(end_date),
         week = isoweek(end_date)
         )

german_election_polls %>% 
  ggplot()+
  # union
  geom_point(aes(x=end_date,y=union),colour="black",shape=21)+
  # span=0.2 to make the line less smoothed
  geom_smooth(aes(x=end_date,y=union),se=F,colour="black",span = 0.2)+ 
  # spd
  geom_point(aes(x=end_date,y=spd),colour="red",shape=21)+
  geom_smooth(aes(x=end_date,y=spd),se=F,colour="red",span = 0.2)+
  # afd
  geom_point(aes(x=end_date,y=af_d),colour="blue",shape=21)+
  geom_smooth(aes(x=end_date,y=af_d),se=F,colour="blue",span = 0.2)+
  #fdp
  geom_point(aes(x=end_date,y=fdp),colour="yellow",shape=21)+
  geom_smooth(aes(x=end_date,y=fdp),se=F,colour="yellow",span = 0.2)+
  #grune
  geom_point(aes(x=end_date,y=grune),colour="dark green",shape=21)+
  geom_smooth(aes(x=end_date,y=grune),se=F,colour="dark green",span = 0.2)+
  #linke
  geom_point(aes(x=end_date,y=linke),colour="purple",shape=21)+
  geom_smooth(aes(x=end_date,y=linke),se=F,colour="purple",span = 0.2)+
  #display every month
  scale_x_date(date_labels="%b %y",date_breaks  ="1 month")+
  labs(
    x="Date",
    y="Votes %"
  )+
  NULL

7 Deliverables

There is a lot of explanatory text, comments, etc. You do not need these, so delete them and produce a stand-alone document that you could share with someone. Knit the edited and completed R Markdown file as an HTML document (use the “Knit” button at the top of the script editor window) and upload it to Canvas.

8 Details

  • Who did you collaborate with: Alex Kubbinga, Clara, Jean Huang, Raghav Mehta, Raina Doshi, Yuan Gao
  • Approximately how much time did you spend on this problem set: 8 hours
  • What, if anything, gave you the most trouble: ANSWER HERE

Please seek out help when you need it, and remember the 15-minute rule. You know enough R (and have enough examples of code from class and your readings) to be able to do this. If you get stuck, ask for help from others, post a question on Slack– and remember that I am here to help too!

As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?

9 Rubric

Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.

Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).

Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.